NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Free-Grained Hierarchical Recognition

Park, Seulki; Wang, Zilin; Yu, Stella X (June 2026, IEEE/CVF Conference on Computer Vision and Pattern Recognition)

Hierarchical image recognition seeks to predict class labels along a semantic taxonomy, from broad categories to specific ones, typically under the tidy assumption that every training image is fully annotated along its taxonomy path. Reality is messier: A distant bird may be labeled only bird, while a clear close-up may justify bald eagle. We introduce free-grain training, where labels may appear at any level of the taxonomy and models must learn consistent hierarchical predictions from incomplete, mixed-granularity supervision. We build benchmark datasets with varying label granularity and show that existing hierarchical methods deteriorate sharply in this setting. To make up for missing supervision, we propose two simple solutions: One adds broad text-based supervision that captures visual attributes, and the other treats missing labels at specific taxonomy levels as a semi-supervised learning problem. We also study free-grained inference, where the model chooses how deep to predict, returning a reliable coarse label when a fine-grained one is uncertain. Together, our task, datasets, and methods move hierarchical recognition closer to the way labels arise in the real world. Our dataset and code is available at \url{https://github.com/pseulki/FreeGrainLearning}.
more » « less
Full Text Available
GeoSANE: Learning Geospatial Representations From Models, Not Data

Hanna, Joelle; Falk, Damian; Yu, Stella X; Borth, Damian (June 2026, IEEE/CVF Conference on Computer Vision and Pattern Recognition)

Recent advances in remote sensing have led to an increase in the number of available foundation models; each trained on different modalities, datasets, and objectives, yet capturing only part of the vast geospatial knowledge landscape. While these models show strong results within their respective domains, their capabilities remain complementary rather than unified. Therefore, instead of choosing one model over another, we aim to combine their strengths into a single shared representation. We introduce GeoSANE, a geospatial model foundry that learns a unified neural representation from the weights of existing foundation models and task-specific models, able to generate novel neural networks weights on-demand. Given a target architecture, GeoSANE generates weights ready for finetuning for classification, segmentation, and detection tasks across multiple modalities. Models generated by GeoSANE consistently outperform their counterparts trained from scratch, match or surpass state-of-the-art remote sensing foundation models, and outperform models obtained through pruning or knowledge distillation when generating lightweight networks. Evaluations across ten diverse datasets and on GEO-Bench confirm its strong generalization capabilities. By shifting from pre-training to weight generation, GeoSANE introduces a new framework for unifying and transferring geospatial knowledge across models and tasks.
more » « less
Full Text Available
Unified Humanoid Fall-Safety Policy from A Few Demonstrations

Xu, Zhengjie; Li, Ye; Lin, Kwan-Yee; Yu, Stella X (June 2026, IEEE International Conference on Robotics and Automation)

Falling is an inherent risk of humanoid mobility. Maintaining stability is thus a primary safety focus in robot control and learning, yet no existing approach fully averts loss of balance. When instability does occur, prior work addresses only isolated aspects of falling: avoiding falls, choreographing a controlled descent, or standing up afterward. Consequently, humanoid robots lack integrated strategies for impact mitigation and prompt recovery when real falls defy these scripts. We aim to go beyond keeping balance to make the entire fall-and-recovery process safe and autonomous: prevent falls when possible, reduce impact when unavoidable, and stand up when fallen. By fusing sparse human demonstrations with reinforcement learning and an adaptive diffusion-based memory of safe reactions, we learn whole-body behaviors that unify fall prevention, impact mitigation, and rapid recovery in one policy. Experiments in simulation and on a Unitree G1 demonstrate robust sim-to-real transfer, lower impact forces, and consistently fast recovery across diverse disturbances, pointing toward safer, more resilient humanoids in real environments. Videos are available at~\url{https://firm2025.github.io/}.
more » « less
Full Text Available
Noise-Tolerant Novel-View SAR Synthesis via Denoising Diffusion

Rahimi, Amir; Yu, Stella X (January 2026, IEEE transactions on geoscience and remote sensing)

Amir Rahimi and Stella X. Yu
more » « less
Full Text Available
Normalize Filters! Classical Wisdom for Deep Vision

Perez, Gustavo; Yu, Stella X (December 2025, Neural Information Processing Systems)

Classical image filters, such as those for averaging or differencing, are carefully normalized to ensure consistency, interpretability, and to avoid artifacts like intensity shifts, halos, or ringing. In contrast, convolutional filters learned end-to-end in deep networks lack such constraints. Although they may resemble wavelets and blob/edge detectors, they are not normalized in the same or any way. Consequently, when images undergo atmospheric transfer, their responses become distorted, leading to incorrect outcomes. We address this limitation by proposing filter normalization, followed by learnable scaling and shifting, akin to batch normalization. This simple yet effective modification ensures that the filters are atmosphere-equivariant, enabling co-domain symmetry. By integrating classical filtering principles into deep learning (applicable to both convolutional neural networks and convolution-dependent vision transformers), our method achieves significant improvements on artificial and natural intensity variation benchmarks. Our ResNet34 could even outperform CLIP by a large margin. Our analysis reveals that unnormalized filters degrade performance, whereas filter normalization regularizes learning, promotes diversity, and improves robustness and generalization.
more » « less
Full Text Available
Novel View Synthesis from A Few Glimpses via Test-Time Natural Video Completion

Xu, Yan; Wang, Yixing; Yu, Stella X (December 2025, Neural Information Processing Systems)

Given just a few glimpses of a scene, can you imagine the movie playing out as the camera glides through it? That's the lens we take on \emph{sparse-input novel view synthesis}, not only as filling spatial gaps between widely spaced views, but also as \emph{completing a natural video} unfolding through space. We recast the task as \emph{test-time natural video completion}, using powerful priors from \emph{pretrained video diffusion models} to hallucinate plausible in-between views. Our \emph{zero-shot, generation-guided} framework produces pseudo views at novel camera poses, modulated by an \emph{uncertainty-aware mechanism} for spatial coherence. These synthesized frames densify supervision for \emph{3D Gaussian Splatting} (3D-GS) for scene reconstruction, especially in under-observed regions. An iterative feedback loop lets 3D geometry and 2D view synthesis inform each other, improving both the scene reconstruction and the generated views. The result is coherent, high-fidelity renderings from sparse inputs \emph{without any scene-specific training or fine-tuning}. On LLFF, DTU, DL3DV, and MipNeRF-360, our method significantly outperforms strong 3D-GS baselines under extreme sparsity. Our project page is at \url{https://decayale.github.io/project/SV2CGS}.
more » « less
Full Text Available
Wholly Unsupervised! Segmenting Objects by Contrast and Context

Pan, Fei; Wang, Yixing; Jeon, Sangryul Jeon; Yu, Stella X (December 2025, Neural Information Processing Systems Workshop on Space in Vision, Language, and Embodied AI)

We study \emph{unsupervised whole object segmentation} - identifying complete objects, including both distinctive and less salient parts, rather than only visually prominent fragments. Existing unsupervised methods often focus on salient regions (e.g., \emph{head} but not \emph{torso}), leading to incomplete object masks. Our insight is that whole objects emerge from the interplay of \emph{part-level similarity} and \emph{contrastive context}, both \emph{within} and \emph{across} images. This enables the grouping of heterogeneous regions into coherent object segments without any supervision or predefined templates. We propose \emph{Contrastive Contextual Grouping} (CCG) in a three-step algorithm: {\bf 1)} identify semantically similar yet visually diverse image pairs; {\bf 2)} perform co-segmentation via joint graph cuts with contrastive part-context affinity; and {\bf 3)} distill the results into a single-image segmentation model. CCG achieves state-of-the-art results across \emph{unsupervised saliency detection, object discovery, video object segmentation}, and \emph{nuclei segmentation}. Remarkably, it could even \emph{surpass} SAM2, a supervised foundation model, at segmenting whole objects from box prompts.
more » « less
Full Text Available
Let Humanoids Hike! Integrative Skill Development over Complex Trails

Lin, Kwan-Yee; Yu, Stella X (June 2025, IEEE/CVF Conference on Computer Vision and Pattern Recognition)

Full Text Available
Let Humanoids Hike! Integrative Skill Development over Complex Trails

Lin, Kwan-Yee; Yu, Stella X (June 2025, IEEE/CVF Conference on Computer Vision and Pattern Recognition)

Hiking on complex trails demands balance, agility, and adaptive decision-making over unpredictable terrain. Current humanoid research remains fragmented and inadequate for hiking: locomotion focuses on motor skills without long-term goals or situational awareness, while semantic navigation overlooks real-world embodiment and local terrain variability. We propose training humanoids to hike on complex trails, driving integrative skill development across visual perception, decision making, and motor execution. We develop a learning framework, LEGO-H, that enables a vision-equipped humanoid robot to hike complex trails autonomously. We introduce two technical innovations: {\bf 1)} A temporal vision transformer variant anticipates future local goals to guide movement, seamlessly integrating locomotion with goal-directed navigation. {\bf 2)} Latent representations of joint movement patterns, combined with hierarchical metric learning, enable smooth policy transfer from privileged training to onboard execution. These components allow LEGO-H to handle diverse physical and environmental challenges without relying on predefined motion patterns. Experiments across varied simulated trails and robot morphologies highlight LEGO-H's versatility and robustness, positioning hiking as a compelling testbed for embodied autonomy and LEGO-H as a baseline for future humanoid development.
more » « less
Full Text Available
Test-Time Canonicalization by Foundation Models for Robust Perception

Singhal, Utkarsh; Feng, Ryan; Yu, Stella X; Prakash, Atul (July 2025, International Conference on Machine Learning)

Full Text Available

« Prev Next »

Search for: All records